Working with SQL Databases in R

Author

Tony Duan

This document explains how to connect to and interact with SQL databases from within R. We will cover writing raw SQL queries, using the tidyverse-friendly dbplyr package, and other best practices.

1. Connecting to a Database

The DBI package provides a universal, consistent interface for connecting to any database. Specific packages like RSQLite, RPostgres, or odbc provide the actual drivers to connect to different database systems.

For these examples, we will use RSQLite, which creates a lightweight, file-based SQL database.

Code
library(DBI)
library(RSQLite)
library(tidyverse)

Create and Populate a Database

First, we connect to a database file (it will be created if it doesn’t exist). Then, we write the mtcars and iris R data frames into the database as new tables.

Code
# Add car names as a column in mtcars
mtcars_df =cbind(car_name = rownames(mtcars), mtcars)

# Create a connection to an SQLite database file
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")

# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)

We can list the tables to confirm they were created.

Code
dbListTables(con)
[1] "car_parts" "iris"      "mtcars"    "part_info"

2. Method 1: Writing Raw SQL Queries

The dbGetQuery() function sends a SQL statement to the database and returns the result as a data frame.

SELECT Statement

Select all columns from the mtcars table, limiting the result to the first 3 rows.

Code
sql_select <- "SELECT * FROM mtcars LIMIT 3"
dbGetQuery(con, sql_select)
       car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Column Aliasing and Creation

Use AS to rename a column in the output. You can also create new columns based on calculations.

Code
sql_create_col <- "SELECT mpg, mpg + 1 AS mpg_plus_one FROM mtcars"
head(dbGetQuery(con, sql_create_col))
   mpg mpg_plus_one
1 21.0         22.0
2 21.0         22.0
3 22.8         23.8
4 21.4         22.4
5 18.7         19.7
6 18.1         19.1

Filtering with WHERE

The WHERE clause filters rows based on a condition.

Code
sql_filter <- "SELECT * FROM mtcars WHERE hp > 200 AND cyl = 8"
head(dbGetQuery(con, sql_filter))
             car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1          Duster 360 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
2  Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3 Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
4   Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5          Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
6      Ford Pantera L 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4

Aggregation with GROUP BY

GROUP BY collapses multiple rows into a single summary row. It’s used with aggregate functions like AVG(), COUNT(), MIN(), and MAX().

This query calculates the average horsepower and row count for each cylinder group.

Code
sql_groupby <- "SELECT cyl, AVG(hp) AS avg_hp, COUNT(*) AS num_cars FROM mtcars GROUP BY cyl"
dbGetQuery(con, sql_groupby)
  cyl    avg_hp num_cars
1   4  82.63636       11
2   6 122.28571        7
3   8 209.21429       14

Sorting with ORDER BY

ORDER BY sorts the result set in ascending (ASC) or descending (DESC) order.

Code
sql_orderby <- "SELECT car_name, mpg, hp FROM mtcars ORDER BY hp DESC LIMIT 5"
dbGetQuery(con, sql_orderby)
           car_name  mpg  hp
1     Maserati Bora 15.0 335
2    Ford Pantera L 15.8 264
3        Duster 360 14.3 245
4        Camaro Z28 13.3 245
5 Chrysler Imperial 14.7 230

Joins

This example creates two tables, car_parts and part_info, to demonstrate joins.

Code
dbWriteTable(con, "car_parts", data.frame(car_name = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive"), part_id = c(1, 2, 3)), overwrite = TRUE)
dbWriteTable(con, "part_info", data.frame(part_id = c(1, 2, 4), part_name = c("Engine", "Tire", "Brake")), overwrite = TRUE)

INNER JOIN

Returns only the rows where the join key (part_id) exists in both tables.

Code
sql_inner_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
INNER JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_inner_join)
    car_name part_name
1  Mazda RX4    Engine
2 Datsun 710      Tire

LEFT JOIN

Returns all rows from the left table (car_parts) and the matched rows from the right table (part_info).

Code
sql_left_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
LEFT JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_left_join)
        car_name part_name
1      Mazda RX4    Engine
2     Datsun 710      Tire
3 Hornet 4 Drive      <NA>

String and Date Functions

SQL has functions for manipulating strings and dates. The exact syntax can vary by database. In SQLite, you can use UPPER() and strftime().

Code
sql_string_date <- "SELECT car_name, UPPER(car_name) AS upper_name, '2023-10-27' AS today from car_parts"
head(dbGetQuery(con, sql_string_date))
        car_name     upper_name      today
1      Mazda RX4      MAZDA RX4 2023-10-27
2     Datsun 710     DATSUN 710 2023-10-27
3 Hornet 4 Drive HORNET 4 DRIVE 2023-10-27

Modifying Data

UPDATE Table

UPDATE modifies existing records in a table.

Code
# Note: dbExecute is used for statements that don't return data
dbExecute(con, "UPDATE mtcars SET hp = 120 WHERE car_name = 'Mazda RX4'")
[1] 1
Code
dbGetQuery(con, "SELECT car_name, hp FROM mtcars WHERE car_name = 'Mazda RX4'")
   car_name  hp
1 Mazda RX4 120

CREATE and DROP Table

CREATE TABLE builds a new table, and DROP TABLE deletes it.

Code
sql_create <- "CREATE TABLE mtcars_copy AS SELECT * FROM mtcars"
dbExecute(con, sql_create)
[1] 0
Code
dbListTables(con)
[1] "car_parts"   "iris"        "mtcars"      "mtcars_copy" "part_info"  
Code
sql_drop <- "DROP TABLE mtcars_copy"
dbExecute(con, sql_drop)
[1] 0
Code
dbListTables(con)
[1] "car_parts" "iris"      "mtcars"    "part_info"

3. Method 2: Using dbplyr

dbplyr is a tidyverse backend for databases. It allows you to write familiar dplyr code, which dbplyr translates into SQL for you. This is the recommended modern approach as it is easier to write and less prone to SQL injection errors.

First, we create a connection to the mtcars table in the database.

Code
mtcars_db <- tbl(con, "mtcars")

Now, we can use dplyr verbs on this object. The code is not run in R; it is translated to SQL and sent to the database.

Code
query <- mtcars_db %>%
  select(car_name, mpg, hp, cyl) %>%
  filter(hp > 150) %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg, na.rm = TRUE)) %>%
  arrange(desc(avg_mpg))

We can use show_query() to see the SQL that dbplyr generated.

Code
show_query(query)
<SQL>
SELECT `cyl`, AVG(`mpg`) AS `avg_mpg`
FROM (
  SELECT `car_name`, `mpg`, `hp`, `cyl`
  FROM `mtcars`
  WHERE (`hp` > 150.0)
) AS `q01`
GROUP BY `cyl`
ORDER BY `avg_mpg` DESC

To execute the query and pull the results back into an R data frame, we use collect().

Code
results <- collect(query)
results
# A tibble: 2 × 2
    cyl avg_mpg
  <dbl>   <dbl>
1     6    19.7
2     8    15.1

4. Best Practice: Parameterized Queries

To prevent SQL injection attacks, you should not use paste() to insert values into a SQL query. Instead, use parameterized queries with ? as a placeholder. DBI will safely insert the values for you.

Code
hp_threshold <- 150
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"

# dbGetQuery can take a `params` list
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()
            car_name  hp
1  Hornet Sportabout 175
2         Duster 360 245
3         Merc 450SE 180
4         Merc 450SL 180
5        Merc 450SLC 180
6 Cadillac Fleetwood 205

5. Method 3: Using sqldf on Local Data Frames

The sqldf package allows you to run SQL queries on local R data frames, not on an external database. This can be useful for users who are more comfortable with SQL than with dplyr syntax for data manipulation.

Code
library(sqldf)

# Note the use of the R data frame `mtcars_df`
sqldf("SELECT car_name, mpg FROM mtcars_df WHERE cyl = 8 LIMIT 5")
           car_name  mpg
1 Hornet Sportabout 18.7
2        Duster 360 14.3
3        Merc 450SE 16.4
4        Merc 450SL 17.3
5       Merc 450SLC 15.2

6. Disconnecting from the Database

It is very important to close the database connection when you are finished to release resources.

Code
dbDisconnect(con)
Back to top